This project delves into the captivating world of the FIFA World Cup using an extensive dataset sourced from the FIFA World Cup Archive. The dataset, available at the following link, encompasses information about every World Cup held annually from 1930 up to 2014 (except during 1942 and 1946 due to the Second World War). It consists of three primary components:
This project has two main objectives: to gain insights through SQL analysis and present these insights through visualizations. By using SQL queries, the aim is to uncover interesting stories, trends, and facts about the World Cup. Answering questions about team performances, match dynamics, and standout players.
To make these insights accessible and engaging, data visualization techniques will be used. The visualizations will showcase match outcomes, team achievements, and player contributions in an easy-to-understand manner.
#!pip install psycopg2-binary
import pandas as pd
from IPython.display import Image, display
from PIL import Image as PILImage
from sqlalchemy import create_engine, text
import psycopg2
import plotly.express as px
import plotly.graph_objects as go
import dash
from dash import dcc, html
from plotly.subplots import make_subplots
import plotly.io as pio
#Let's start by importing the csv files
world_cup = pd.read_csv("WorldCups.csv")
matches = pd.read_csv("WorldCupMatches.csv")
players = pd.read_csv("WorldCupPlayers.csv")
Next, we will take a quick look at each dataset
#First rows of World Cup dataset
world_cup.head()
| Year | Country | Winner | Runners-Up | Third | Fourth | GoalsScored | QualifiedTeams | MatchesPlayed | Attendance | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1930 | Uruguay | Uruguay | Argentina | USA | Yugoslavia | 70 | 13 | 18 | 590.549 |
| 1 | 1934 | Italy | Italy | Czechoslovakia | Germany | Austria | 70 | 16 | 17 | 363.000 |
| 2 | 1938 | France | Italy | Hungary | Brazil | Sweden | 84 | 15 | 18 | 375.700 |
| 3 | 1950 | Brazil | Uruguay | Brazil | Sweden | Spain | 88 | 13 | 22 | 1.045.246 |
| 4 | 1954 | Switzerland | Germany FR | Hungary | Austria | Uruguay | 140 | 16 | 26 | 768.607 |
#Last rows of World Cup dataset
world_cup.tail()
| Year | Country | Winner | Runners-Up | Third | Fourth | GoalsScored | QualifiedTeams | MatchesPlayed | Attendance | |
|---|---|---|---|---|---|---|---|---|---|---|
| 15 | 1998 | France | France | Brazil | Croatia | Netherlands | 171 | 32 | 64 | 2.785.100 |
| 16 | 2002 | Korea/Japan | Brazil | Germany | Turkey | Korea Republic | 161 | 32 | 64 | 2.705.197 |
| 17 | 2006 | Germany | Italy | France | Germany | Portugal | 147 | 32 | 64 | 3.359.439 |
| 18 | 2010 | South Africa | Spain | Netherlands | Germany | Uruguay | 145 | 32 | 64 | 3.178.856 |
| 19 | 2014 | Brazil | Germany | Argentina | Netherlands | Brazil | 171 | 32 | 64 | 3.386.810 |
#NAs in the World Cup dataset
missing_values_wc = world_cup.isna().sum()
print(missing_values_wc)
Year 0 Country 0 Winner 0 Runners-Up 0 Third 0 Fourth 0 GoalsScored 0 QualifiedTeams 0 MatchesPlayed 0 Attendance 0 dtype: int64
# Tranform Attendance into integer (remove dots)
world_cup['Attendance'] = world_cup['Attendance'].str.replace('.', '').astype(int)
/var/folders/h9/t189rfrn65l6ph7vmp0hyf7r0000gn/T/ipykernel_24421/3236473007.py:2: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
world_cup['Attendance'] = world_cup['Attendance'].str.replace('.', '').astype(int)
# Transform column names to lowercase and remove special characters (avoid problems when populating tables into the database)
world_cup.columns = world_cup.columns.str.replace(r'\W+', '_', regex=True).str.lower()
world_cup.head()
| year | country | winner | runners_up | third | fourth | goalsscored | qualifiedteams | matchesplayed | attendance | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1930 | Uruguay | Uruguay | Argentina | USA | Yugoslavia | 70 | 13 | 18 | 590549 |
| 1 | 1934 | Italy | Italy | Czechoslovakia | Germany | Austria | 70 | 16 | 17 | 363000 |
| 2 | 1938 | France | Italy | Hungary | Brazil | Sweden | 84 | 15 | 18 | 375700 |
| 3 | 1950 | Brazil | Uruguay | Brazil | Sweden | Spain | 88 | 13 | 22 | 1045246 |
| 4 | 1954 | Switzerland | Germany FR | Hungary | Austria | Uruguay | 140 | 16 | 26 | 768607 |
#First rows of Matches dataset
matches.head()
| Year | Datetime | Stage | Stadium | City | Home Team Name | Home Team Goals | Away Team Goals | Away Team Name | Win conditions | Attendance | Half-time Home Goals | Half-time Away Goals | Referee | Assistant 1 | Assistant 2 | RoundID | MatchID | Home Team Initials | Away Team Initials | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1930.0 | 13 Jul 1930 - 15:00 | Group 1 | Pocitos | Montevideo | France | 4.0 | 1.0 | Mexico | 4444.0 | 3.0 | 0.0 | LOMBARDI Domingo (URU) | CRISTOPHE Henry (BEL) | REGO Gilberto (BRA) | 201.0 | 1096.0 | FRA | MEX | |
| 1 | 1930.0 | 13 Jul 1930 - 15:00 | Group 4 | Parque Central | Montevideo | USA | 3.0 | 0.0 | Belgium | 18346.0 | 2.0 | 0.0 | MACIAS Jose (ARG) | MATEUCCI Francisco (URU) | WARNKEN Alberto (CHI) | 201.0 | 1090.0 | USA | BEL | |
| 2 | 1930.0 | 14 Jul 1930 - 12:45 | Group 2 | Parque Central | Montevideo | Yugoslavia | 2.0 | 1.0 | Brazil | 24059.0 | 2.0 | 0.0 | TEJADA Anibal (URU) | VALLARINO Ricardo (URU) | BALWAY Thomas (FRA) | 201.0 | 1093.0 | YUG | BRA | |
| 3 | 1930.0 | 14 Jul 1930 - 14:50 | Group 3 | Pocitos | Montevideo | Romania | 3.0 | 1.0 | Peru | 2549.0 | 1.0 | 0.0 | WARNKEN Alberto (CHI) | LANGENUS Jean (BEL) | MATEUCCI Francisco (URU) | 201.0 | 1098.0 | ROU | PER | |
| 4 | 1930.0 | 15 Jul 1930 - 16:00 | Group 1 | Parque Central | Montevideo | Argentina | 1.0 | 0.0 | France | 23409.0 | 0.0 | 0.0 | REGO Gilberto (BRA) | SAUCEDO Ulises (BOL) | RADULESCU Constantin (ROU) | 201.0 | 1085.0 | ARG | FRA |
#Last rows of Matches dataset
matches.tail()
| Year | Datetime | Stage | Stadium | City | Home Team Name | Home Team Goals | Away Team Goals | Away Team Name | Win conditions | Attendance | Half-time Home Goals | Half-time Away Goals | Referee | Assistant 1 | Assistant 2 | RoundID | MatchID | Home Team Initials | Away Team Initials | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4567 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4568 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4569 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4570 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4571 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
#NAs in the Matches dataset
missing_values_matches = matches.isna().sum()
print(missing_values_matches)
Year 3720 Datetime 3720 Stage 3720 Stadium 3720 City 3720 Home Team Name 3720 Home Team Goals 3720 Away Team Goals 3720 Away Team Name 3720 Win conditions 3720 Attendance 3722 Half-time Home Goals 3720 Half-time Away Goals 3720 Referee 3720 Assistant 1 3720 Assistant 2 3720 RoundID 3720 MatchID 3720 Home Team Initials 3720 Away Team Initials 3720 dtype: int64
#Drop records where Year is missing
matches.dropna(subset=['Year'], inplace=True)
#Which countries have the most World Cup participations
matches['Home Team Name'].value_counts()
Brazil 82
Italy 57
Argentina 54
Germany FR 43
England 35
..
Wales 1
Norway 1
rn">United Arab Emirates 1
Haiti 1
rn">Bosnia and Herzegovina 1
Name: Home Team Name, Length: 78, dtype: int64
Some countries have incorrect formatting containing rn". We will be cleaning this
#Which stadiums have seen the most World CUp matches
matches['Stadium'].value_counts()
Estadio Azteca 19
Jalisco 14
Olympiastadion 14
Nou Camp - Estadio Le�n 11
Estadio Centenario 10
..
Comunale di Cornaredo 1
Rimnersvallen 1
Tunavallen 1
White City 1
Eyravallen 1
Name: Stadium, Length: 181, dtype: int64
# Identify stadiums with special characters that are not read correctly
matches[matches['Stadium'].str.contains('�', na=False)]['Stadium'].value_counts()
Nou Camp - Estadio Le�n 11 Stade V�lodrome 9 Maracan� - Est�dio Jornalista M�rio Filho 8 Estadio Ol�mpico Chateau Carreras 8 Estadio Jos� Mar�a Minella 6 Estadio Ol�mpico Universitario 4 Estadio Municipal de Bala�dos 3 Name: Stadium, dtype: int64
# Check for duplicates based on the 'matchid' column
duplicates = matches[matches.duplicated(subset=['MatchID'], keep=False)]
# Drop duplicates based on the 'matchid' column
matches = matches.drop_duplicates(subset=['MatchID'])
# Corrections for stadiums with special characters
corrections_stadiums = {
'Nou Camp - Estadio Le�n': 'Nou Camp - Estadio Leon',
'Stade V�lodrome': 'Stade Velodrome',
'Maracan� - Est�dio Jornalista M�rio Filho': 'Maracana - Estadio Jornalista Mario Filho',
'Estadio Ol�mpico Chateau Carreras': 'Estadio Olimpico Chateau Carreras',
'Estadio Jos� Mar�a Minella': 'Estadio Jose Maria Minella',
'Estadio Ol�mpico Universitario': 'Estadio Olimpico Universitario',
'Estadio Municipal de Bala�dos': 'Estadio Municipal de Balaidos'
}
# Apply corrections to the 'Stadium' column
matches['Stadium'] = matches['Stadium'].replace(corrections_stadiums)
#Remove decimal from year column
matches['Year'] = matches['Year'].astype(int)
matches.head()
| Year | Datetime | Stage | Stadium | City | Home Team Name | Home Team Goals | Away Team Goals | Away Team Name | Win conditions | Attendance | Half-time Home Goals | Half-time Away Goals | Referee | Assistant 1 | Assistant 2 | RoundID | MatchID | Home Team Initials | Away Team Initials | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1930 | 13 Jul 1930 - 15:00 | Group 1 | Pocitos | Montevideo | France | 4.0 | 1.0 | Mexico | 4444.0 | 3.0 | 0.0 | LOMBARDI Domingo (URU) | CRISTOPHE Henry (BEL) | REGO Gilberto (BRA) | 201.0 | 1096.0 | FRA | MEX | |
| 1 | 1930 | 13 Jul 1930 - 15:00 | Group 4 | Parque Central | Montevideo | USA | 3.0 | 0.0 | Belgium | 18346.0 | 2.0 | 0.0 | MACIAS Jose (ARG) | MATEUCCI Francisco (URU) | WARNKEN Alberto (CHI) | 201.0 | 1090.0 | USA | BEL | |
| 2 | 1930 | 14 Jul 1930 - 12:45 | Group 2 | Parque Central | Montevideo | Yugoslavia | 2.0 | 1.0 | Brazil | 24059.0 | 2.0 | 0.0 | TEJADA Anibal (URU) | VALLARINO Ricardo (URU) | BALWAY Thomas (FRA) | 201.0 | 1093.0 | YUG | BRA | |
| 3 | 1930 | 14 Jul 1930 - 14:50 | Group 3 | Pocitos | Montevideo | Romania | 3.0 | 1.0 | Peru | 2549.0 | 1.0 | 0.0 | WARNKEN Alberto (CHI) | LANGENUS Jean (BEL) | MATEUCCI Francisco (URU) | 201.0 | 1098.0 | ROU | PER | |
| 4 | 1930 | 15 Jul 1930 - 16:00 | Group 1 | Parque Central | Montevideo | Argentina | 1.0 | 0.0 | France | 23409.0 | 0.0 | 0.0 | REGO Gilberto (BRA) | SAUCEDO Ulises (BOL) | RADULESCU Constantin (ROU) | 201.0 | 1085.0 | ARG | FRA |
# Transform column names to lowercase and remove special characters (avoid problems when populating tables into the database)
matches.columns = matches.columns.str.replace(r'\W+', '_', regex=True).str.lower()
matches.head()
| year | datetime | stage | stadium | city | home_team_name | home_team_goals | away_team_goals | away_team_name | win_conditions | attendance | half_time_home_goals | half_time_away_goals | referee | assistant_1 | assistant_2 | roundid | matchid | home_team_initials | away_team_initials | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1930 | 13 Jul 1930 - 15:00 | Group 1 | Pocitos | Montevideo | France | 4.0 | 1.0 | Mexico | 4444.0 | 3.0 | 0.0 | LOMBARDI Domingo (URU) | CRISTOPHE Henry (BEL) | REGO Gilberto (BRA) | 201.0 | 1096.0 | FRA | MEX | |
| 1 | 1930 | 13 Jul 1930 - 15:00 | Group 4 | Parque Central | Montevideo | USA | 3.0 | 0.0 | Belgium | 18346.0 | 2.0 | 0.0 | MACIAS Jose (ARG) | MATEUCCI Francisco (URU) | WARNKEN Alberto (CHI) | 201.0 | 1090.0 | USA | BEL | |
| 2 | 1930 | 14 Jul 1930 - 12:45 | Group 2 | Parque Central | Montevideo | Yugoslavia | 2.0 | 1.0 | Brazil | 24059.0 | 2.0 | 0.0 | TEJADA Anibal (URU) | VALLARINO Ricardo (URU) | BALWAY Thomas (FRA) | 201.0 | 1093.0 | YUG | BRA | |
| 3 | 1930 | 14 Jul 1930 - 14:50 | Group 3 | Pocitos | Montevideo | Romania | 3.0 | 1.0 | Peru | 2549.0 | 1.0 | 0.0 | WARNKEN Alberto (CHI) | LANGENUS Jean (BEL) | MATEUCCI Francisco (URU) | 201.0 | 1098.0 | ROU | PER | |
| 4 | 1930 | 15 Jul 1930 - 16:00 | Group 1 | Parque Central | Montevideo | Argentina | 1.0 | 0.0 | France | 23409.0 | 0.0 | 0.0 | REGO Gilberto (BRA) | SAUCEDO Ulises (BOL) | RADULESCU Constantin (ROU) | 201.0 | 1085.0 | ARG | FRA |
#First rows of Players dataset
players.head()
| RoundID | MatchID | Team Initials | Coach Name | Line-up | Shirt Number | Player Name | Position | Event | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 201 | 1096 | FRA | CAUDRON Raoul (FRA) | S | 0 | Alex THEPOT | GK | NaN |
| 1 | 201 | 1096 | MEX | LUQUE Juan (MEX) | S | 0 | Oscar BONFIGLIO | GK | NaN |
| 2 | 201 | 1096 | FRA | CAUDRON Raoul (FRA) | S | 0 | Marcel LANGILLER | NaN | G40' |
| 3 | 201 | 1096 | MEX | LUQUE Juan (MEX) | S | 0 | Juan CARRENO | NaN | G70' |
| 4 | 201 | 1096 | FRA | CAUDRON Raoul (FRA) | S | 0 | Ernest LIBERATI | NaN | NaN |
#Last rows of Matches dataset
players.tail()
| RoundID | MatchID | Team Initials | Coach Name | Line-up | Shirt Number | Player Name | Position | Event | |
|---|---|---|---|---|---|---|---|---|---|
| 37779 | 255959 | 300186501 | ARG | SABELLA Alejandro (ARG) | N | 19 | ALVAREZ | NaN | NaN |
| 37780 | 255959 | 300186501 | GER | LOEW Joachim (GER) | N | 6 | KHEDIRA | NaN | NaN |
| 37781 | 255959 | 300186501 | ARG | SABELLA Alejandro (ARG) | N | 20 | AGUERO | NaN | IH46' Y65' |
| 37782 | 255959 | 300186501 | GER | LOEW Joachim (GER) | N | 21 | MUSTAFI | NaN | NaN |
| 37783 | 255959 | 300186501 | ARG | SABELLA Alejandro (ARG) | N | 23 | BASANTA | NaN | NaN |
#NAs in the Players dataset
missing_values_players = players.isna().sum()
print(missing_values_players)
RoundID 0 MatchID 0 Team Initials 0 Coach Name 0 Line-up 0 Shirt Number 0 Player Name 0 Position 33641 Event 28715 dtype: int64
# Transform column names to lowercase and remove special characters (avoid problems when populating tables into the database)
players.columns = players.columns.str.replace(r'\W+', '_', regex=True).str.lower()
players.head()
| roundid | matchid | team_initials | coach_name | line_up | shirt_number | player_name | position | event | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 201 | 1096 | FRA | CAUDRON Raoul (FRA) | S | 0 | Alex THEPOT | GK | NaN |
| 1 | 201 | 1096 | MEX | LUQUE Juan (MEX) | S | 0 | Oscar BONFIGLIO | GK | NaN |
| 2 | 201 | 1096 | FRA | CAUDRON Raoul (FRA) | S | 0 | Marcel LANGILLER | NaN | G40' |
| 3 | 201 | 1096 | MEX | LUQUE Juan (MEX) | S | 0 | Juan CARRENO | NaN | G70' |
| 4 | 201 | 1096 | FRA | CAUDRON Raoul (FRA) | S | 0 | Ernest LIBERATI | NaN | NaN |
Retaining the 3 datasets (matches, palyers, world_cup) as separate entities is be a good starting point for an ER diagram in 3N. Each of these entities represents a distinct concept in the World Cup data, and their attributes are well-defined.
The relationships between these entities could be represented as follows:
A world cup has many matches (one-to-many relationship) and a Match has many players (many-to-one relationship).
#Import image
display(Image(filename='ER_diagram_FIFA.jpg'))
# Establish a connection to the PostgreSQL server
conn = psycopg2.connect(
host="localhost",
port='5432',
database="postgres",
user="postgres",
password="123"
)
# Set autocommit to true to create the database
conn.autocommit = True
# Create a new database
db_name = "fifa"
cur = conn.cursor()
cur.execute(f"CREATE DATABASE {db_name}")
print(f"{db_name} database has been successfully created in PostgreSQL.")
fifa database has been successfully created in PostgreSQL.
# Close the connection to the initial 'postgres' database
conn.close()
# Pass the connection string to a variable, conn_url
conn_url = 'postgresql://postgres:123@localhost:5432/fifa'
# Create an engine that connects to the new database
engine = create_engine(conn_url)
# Establish a connection
connection = engine.connect()
# Define SQL commands to create tables (without foreign key constraints)
stmt = (
"""
CREATE TABLE world_cup (
year integer PRIMARY KEY,
country text,
winner text,
runners_up text,
third text,
fourth text,
goalsscored integer,
qualifiedteams integer,
matchesplayed integer,
attendance integer
);
""",
"""
CREATE TABLE matches (
matchid integer PRIMARY KEY,
year integer,
roundid integer,
datetime text,
stage text,
stadium text,
city text,
home_team_name text,
home_team_goals integer,
away_team_goals integer,
away_team_name text,
win_conditions text,
attendance integer,
half_time_home_goals integer,
half_time_away_goals integer,
referee text,
assistant_1 text,
assistant_2 text,
home_team_initials text,
away_team_initials text,
FOREIGN KEY (year) REFERENCES world_cup (year)
);
""",
"""
CREATE TABLE players (
roundid integer,
matchid integer,
team_initials text,
coach_name text,
line_up text,
shirt_number integer,
player_name text,
position text,
event text,
FOREIGN KEY (matchid) REFERENCES matches (matchid)
);
"""
)
# Create a connection and cursor
with psycopg2.connect(database=db_name, user="postgres", password="123", host="localhost") as conn:
with conn.cursor() as cur:
for command in stmt:
cur.execute(command)
# commit the changes to the database
conn.commit()
print('The tables have been successfully created in PostgreSQL.')
The tables have been successfully created in PostgreSQL.
#Insert into world_cup table in database
world_cup.to_sql('world_cup', con=engine, index=False, if_exists='append')
20
#Insert into matches table in database
matches.to_sql('matches', con=engine, index=False, if_exists='append')
836
#Insert into players table in database
players.to_sql('players', con=engine, index=False, if_exists='append')
784
# SQL query to find the top 5 countries with the most World Cup wins
top5_wc = """
SELECT winner, COUNT(*) AS wins
FROM world_cup
WHERE winner IS NOT NULL
GROUP BY winner
ORDER BY wins DESC
LIMIT 5;
"""
# Execute the query and fetch the results into a Pandas DataFrame
top5_wc_df = pd.read_sql_query(top5_wc, conn)
print(top5_wc_df)
winner wins 0 Brazil 5 1 Italy 4 2 Germany FR 3 3 Argentina 2 4 Uruguay 2
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy warnings.warn(
# SQL query to find the teams that have reached the finals the most times.
countries_finals_wc = """
SELECT
team_name,
COUNT(*) AS finals_appearances
FROM
(
SELECT
winner AS team_name
FROM
world_cup
UNION ALL
SELECT
runners_up AS team_name
FROM
world_cup
) AS finals_teams
WHERE
team_name IS NOT NULL
GROUP BY
team_name
ORDER BY
finals_appearances DESC
LIMIT 10;
"""
# Execute the query and fetch the results into a Pandas DataFrame
countries_finals_wc_df = pd.read_sql_query(countries_finals_wc, conn)
print(countries_finals_wc_df)
team_name finals_appearances 0 Brazil 7 1 Germany FR 6 2 Italy 6 3 Argentina 5 4 Netherlands 3 5 Czechoslovakia 2 6 France 2 7 Uruguay 2 8 Germany 2 9 Hungary 2
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy warnings.warn(
# SQL query to find the teams that have reached the finals the most times.
finalist_wc = """
SELECT
country AS team_name,
SUM(CASE WHEN position = 'Winner' THEN 1 ELSE 0 END) AS Winner,
SUM(CASE WHEN position = 'Runners-Up' THEN 1 ELSE 0 END) AS Runners_Up,
SUM(CASE WHEN position = 'Third' THEN 1 ELSE 0 END) AS Third
FROM
(
SELECT
winner AS country,
'Winner' AS position
FROM
world_cup
UNION ALL
SELECT
runners_up AS country,
'Runners-Up' AS position
FROM
world_cup
UNION ALL
SELECT
third AS country,
'Third' AS position
FROM
world_cup
) AS teams
WHERE
country IS NOT NULL
GROUP BY
country
ORDER BY
Winner DESC, Runners_Up DESC, Third DESC;
"""
# Execute the query and fetch the results into a Pandas DataFrame
finalist_wc_df = pd.read_sql_query(finalist_wc, conn)
print(finalist_wc_df)
team_name winner runners_up third 0 Brazil 5 2 2 1 Italy 4 2 1 2 Germany FR 3 3 1 3 Argentina 2 3 0 4 Uruguay 2 0 0 5 Germany 1 1 3 6 France 1 1 2 7 Spain 1 0 0 8 England 1 0 0 9 Netherlands 0 3 1 10 Hungary 0 2 0 11 Czechoslovakia 0 2 0 12 Sweden 0 1 2 13 Poland 0 0 2 14 USA 0 0 1 15 Turkey 0 0 1 16 Chile 0 0 1 17 Austria 0 0 1 18 Croatia 0 0 1 19 Portugal 0 0 1
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy warnings.warn(
# SQL query to find the average number of goals scored per World Cup edition.
goals_wc = """
SELECT year, country AS host_country, AVG(goalsscored) AS avg_goals
FROM world_cup
GROUP BY year
ORDER BY year;
"""
# Execute the query and fetch the results into a Pandas DataFrame
goals_wc_df = pd.read_sql_query(goals_wc, conn)
print(goals_wc_df)
year host_country avg_goals 0 1930 Uruguay 70.0 1 1934 Italy 70.0 2 1938 France 84.0 3 1950 Brazil 88.0 4 1954 Switzerland 140.0 5 1958 Sweden 126.0 6 1962 Chile 89.0 7 1966 England 89.0 8 1970 Mexico 95.0 9 1974 Germany 97.0 10 1978 Argentina 102.0 11 1982 Spain 146.0 12 1986 Mexico 132.0 13 1990 Italy 115.0 14 1994 USA 141.0 15 1998 France 171.0 16 2002 Korea/Japan 161.0 17 2006 Germany 147.0 18 2010 South Africa 145.0 19 2014 Brazil 171.0
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy warnings.warn(
# SQL query to find the host country that had the highest percentage of matches won.
top_host_wc = """
SELECT country,
ROUND(SUM(CASE WHEN winner = home_team_name THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2) AS percentage_won
FROM world_cup
JOIN matches ON world_cup.year = matches.year
GROUP BY country
ORDER BY percentage_won DESC
LIMIT 1;
"""
# Execute the query and fetch the results into a Pandas DataFrame
top_host_wc_df = pd.read_sql_query(top_host_wc, conn)
print(top_host_wc_df)
country percentage_won 0 Uruguay 22.22
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy warnings.warn(
# SQL query to find the correlation between the number of qualified teams and the total goals scored.
qualified_goals_wc = """
SELECT CORR(qualifiedteams, goalsscored) AS correlation
FROM world_cup;
"""
# Execute the query and fetch the results into a Pandas DataFrame
qualified_goals_wc_df = pd.read_sql_query(qualified_goals_wc, conn)
print(qualified_goals_wc_df)
correlation 0 0.866201
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy warnings.warn(
# SQL query to find the average Goals by Stage
goals_stage_wc = """
SELECT stage,
AVG(home_team_goals + away_team_goals) AS avg_goals
FROM matches
GROUP BY stage
ORDER BY avg_goals DESC;
"""
# Execute the query and fetch the results into a Pandas DataFrame
goals_stage_wc_df = pd.read_sql_query(goals_stage_wc, conn)
print(goals_stage_wc_df)
stage avg_goals 0 Preliminary round 5.375000 1 First round 4.888889 2 Group 6 4.750000 3 Third place 4.500000 4 Match for third place 3.933333 5 Semi-finals 3.617647 6 Final 3.578947 7 Group 4 3.200000 8 Group 2 3.169492 9 Play-off for third place 3.000000 10 Quarter-finals 2.822581 11 Group B 2.816667 12 Group 1 2.806452 13 Group 3 2.750000 14 Group A 2.666667 15 Group C 2.625000 16 Group D 2.562500 17 Round of 16 2.562500 18 Group E 2.520833 19 Group H 2.400000 20 Group G 2.333333 21 Group F 1.937500 22 Group 5 1.500000
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy warnings.warn(
# SQL query to find the cities that have hosted the most matches.
city_matches_wc = """
SELECT city, COUNT(*) AS num_matches_hosted
FROM matches
GROUP BY city
ORDER BY num_matches_hosted DESC
LIMIT 10;
"""
# Execute the query and fetch the results into a Pandas DataFrame
city_matches_wc_df = pd.read_sql_query(city_matches_wc, conn)
print(city_matches_wc_df)
city num_matches_hosted 0 Mexico City 23 1 Montevideo 18 2 Guadalajara 17 3 Rio De Janeiro 15 4 Johannesburg 15 5 Buenos Aires 12 6 Sao Paulo 12 7 Munich 11 8 Leon 11 9 Frankfurt/Main 10
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy warnings.warn(
# SQL query to find the World Cup edition with the highest and lowest attendance.
highest_attendance_query = """
SELECT
'Highest Attendance' AS type,
wc.year,
wc.attendance,
wc.country AS host_country
FROM
world_cup wc
ORDER BY
wc.attendance DESC
LIMIT 1;
"""
lowest_attendance_query = """
SELECT
'Lowest Attendance' AS type,
wc.year,
wc.attendance,
wc.country AS host_country
FROM
world_cup wc
WHERE
wc.attendance > 0
ORDER BY
wc.attendance
LIMIT 1;
"""
# Execute the queries and fetch the results into Pandas DataFrames
highest_attendance_df = pd.read_sql_query(highest_attendance_query, conn)
lowest_attendance_df = pd.read_sql_query(lowest_attendance_query, conn)
# Merge the results into a single DataFrame
attendance_df = pd.concat([highest_attendance_df, lowest_attendance_df], ignore_index=True)
# Display the combined result
print(attendance_df)
type year attendance host_country 0 Highest Attendance 1994 3587538 USA 1 Lowest Attendance 1934 363000 Italy
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy warnings.warn( /Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy warnings.warn(
# SQL query to find the average attendance for matches in each stage.
attendance_matches_wc = """
SELECT stage,
AVG(attendance) AS avg_attendance
FROM matches
GROUP BY stage
ORDER BY avg_attendance DESC;
"""
# Execute the query and fetch the results into a Pandas DataFrame
attendance_matches_wc_df = pd.read_sql_query(attendance_matches_wc, conn)
print(attendance_matches_wc_df)
stage avg_attendance 0 Final 76470.263158 1 Play-off for third place 68034.000000 2 Group 6 65658.583333 3 Semi-finals 58956.235294 4 Third place 57741.500000 5 Group A 54321.350000 6 Round of 16 51592.666667 7 Group B 51367.866667 8 Match for third place 50847.866667 9 Group G 48533.166667 10 Group H 46747.033333 11 Group E 45675.916667 12 Group C 45514.937500 13 Group D 45427.333333 14 Quarter-finals 44543.903226 15 Group F 42840.875000 16 Group 1 41664.064516 17 Group 5 35354.000000 18 Group 3 34271.410714 19 Group 2 34241.762712 20 Group 4 25915.745455 21 Preliminary round 16875.000000 22 First round 16120.333333
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy warnings.warn(
# SQL query to find the
matches_goals_wc = """
SELECT
m.matchid,
m.year,
m.stage,
m.attendance,
m.home_team_goals AS home_goals,
m.away_team_goals AS away_goals
FROM
matches AS m;
"""
# Execute the query and fetch the results into a Pandas DataFrame
matches_goals_wc_df = pd.read_sql_query(matches_goals_wc, conn)
print(matches_goals_wc_df)
matchid year stage attendance home_goals away_goals 0 1096 1930 Group 1 4444.0 4 1 1 1090 1930 Group 4 18346.0 3 0 2 1093 1930 Group 2 24059.0 2 1 3 1098 1930 Group 3 2549.0 3 1 4 1085 1930 Group 1 23409.0 1 0 .. ... ... ... ... ... ... 831 300186504 2014 Quarter-finals 68551.0 1 0 832 300186508 2014 Round of 16 58817.0 2 1 833 300186459 2014 Round of 16 41242.0 1 1 834 300186503 2014 Round of 16 63255.0 1 0 835 300186497 2014 Round of 16 51227.0 2 1 [836 rows x 6 columns]
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy warnings.warn(
# SQL query to find the referee who has officiated the most matches.
referee_matches_wc = """
SELECT referee,
COUNT(*) AS matches_officiated
FROM matches
GROUP BY referee
ORDER BY matches_officiated DESC
LIMIT 1;
"""
# Execute the query and fetch the results into a Pandas DataFrame
referee_matches_wc_df = pd.read_sql_query(referee_matches_wc, conn)
print(referee_matches_wc_df)
referee matches_officiated 0 Ravshan IRMATOV (UZB) 9
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy warnings.warn(
# SQL query to find the average number of goals scored in matches where Ravshan IRMATOV officiated
irmatov_matches_wc = """
SELECT AVG(home_team_goals + away_team_goals) AS avg_goals
FROM matches
WHERE referee = 'Ravshan IRMATOV (UZB)';
"""
# Execute the query and fetch the results into a Pandas DataFrame
irmatov_matches_wc_df = pd.read_sql_query(irmatov_matches_wc, conn)
print(irmatov_matches_wc_df)
avg_goals 0 2.333333
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy warnings.warn(
# SQL query to find the World Cup edition with the highest average goals per match.
most_avg_goals_wc = """
SELECT year, AVG(home_team_goals + away_team_goals) AS avg_goals_per_match
FROM matches
GROUP BY year
ORDER BY avg_goals_per_match DESC
LIMIT 1;
"""
# Execute the query and fetch the results into a Pandas DataFrame
most_avg_goals_wc_df = pd.read_sql_query(most_avg_goals_wc, conn)
print(most_avg_goals_wc_df)
year avg_goals_per_match 0 1954 5.384615
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy warnings.warn(
# SQL query to find the average number of goals scored by home teams and away teams.
goals_HT_AT_wc = """
SELECT AVG(home_team_goals) AS avg_home_goals,
AVG(away_team_goals) AS avg_away_goals
FROM matches;
"""
# Execute the query and fetch the results into a Pandas DataFrame
goals_HT_AT_wc_df = pd.read_sql_query(goals_HT_AT_wc, conn)
print(goals_HT_AT_wc_df)
avg_home_goals avg_away_goals 0 1.824163 1.021531
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy warnings.warn(
# SQL query to find the top 10 players who have served as captains the most number of times.
top_captains_wc = """
SELECT
player_name,
team_initials AS country,
COUNT(*) AS captaincy_count
FROM
players
WHERE
position = 'C'
GROUP BY
player_name, team_initials
ORDER BY
captaincy_count DESC
LIMIT 10;
"""
# Execute the query and fetch the results into a Pandas DataFrame
top_captains_wc_df = pd.read_sql_query(top_captains_wc, conn)
print(top_captains_wc_df)
player_name country captaincy_count 0 LAHM GER 17 1 Diego MARADONA ARG 16 2 Kazimierz DEYNA POL 13 3 R. MARQUEZ MEX 13 4 Daniel PASSARELLA ARG 12 5 Ladislav NOVAK TCH 12 6 MESSI ARG 12 7 Uwe SEELER FRG 12 8 Michel PLATINI FRA 11 9 DUNGA BRA 11
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy warnings.warn(
# SQL query to find the Top 5 players with the most goals across all World Cups.
top_players_wc = """
SELECT
p.player_name,
m.home_team_name AS player_country,
COUNT(*) AS goal_count
FROM
players AS p
JOIN
matches AS m ON p.matchid = m.matchid
WHERE
p.event LIKE '%G%'
GROUP BY
p.player_name, m.home_team_name
ORDER BY
goal_count DESC
LIMIT 5;
"""
# Execute the query and fetch the results into a Pandas DataFrame
top_players_wc_df = pd.read_sql_query(top_players_wc, conn)
print(top_players_wc_df)
player_name player_country goal_count 0 RONALDO Brazil 8 1 PEL� (Edson Arantes do Nascimento) Brazil 8 2 Uwe SEELER Germany FR 8 3 KLOSE Germany 7 4 JAIRZINHO Brazil 6
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy warnings.warn(
# SQL query to find the World Cup edition with the highest percentage of away team wins.
at_wins_wc = """
SELECT
wc.year,
(COUNT(CASE WHEN m.away_team_goals > m.home_team_goals THEN 1 END)::float / COUNT(*)) * 100 AS away_win_percentage
FROM
world_cup wc
JOIN
matches m ON wc.year = m.year
GROUP BY
wc.year
ORDER BY
away_win_percentage DESC
LIMIT 1;
"""
# Execute the query and fetch the results into a Pandas DataFrame
at_wins_wc_df = pd.read_sql_query(at_wins_wc, conn)
print(at_wins_wc_df)
year away_win_percentage 0 2010 39.0625
/Users/pilarbifaretti/opt/anaconda3/lib/python3.9/site-packages/pandas/io/sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy warnings.warn(
fig1 = px.bar(
top5_wc_df,
x='winner',
y='wins',
title='Top 5 Countries with the Most World Cup Wins (up to 2014)',
color='winner',
color_discrete_sequence=px.colors.qualitative.Prism
)
fig1.update_xaxes(title_text='Country')
fig1.update_yaxes(title_text='Number of Wins')
fig1.show()
fig2 = px.bar(
countries_finals_wc_df,
x='team_name',
y='finals_appearances',
title='Teams That Have Reached the Finals the Most Times',
color='team_name', # Color bars by team
color_discrete_sequence=px.colors.qualitative.Prism
)
fig2.update_xaxes(title_text='Team')
fig2.update_yaxes(title_text='Finals Appearances')
fig2.show()
fig3 = px.bar(
finalist_wc_df,
x='team_name',
y=['winner', 'runners_up', 'third'],
title='Number of Times Teams Have Reached the Top 3 Places',
labels={'value': 'Count', 'variable': 'Position'},
color_discrete_sequence=px.colors.qualitative.Prism
)
fig3.update_xaxes(title_text='Team')
fig3.update_yaxes(title_text='Count')
fig3.show()
fig4 = px.line(
goals_wc_df,
x='year',
y='avg_goals',
title='Average Goals Scored per World Cup Edition',
color_discrete_sequence=px.colors.qualitative.Prism,
hover_name='host_country' ,
labels={'avg_goals': 'Average Goals'}
)
fig4.update_xaxes(title_text='Year')
fig4.update_yaxes(title_text='Average Goals')
fig4.show()
fig5 = px.bar(
top_players_wc_df,
x='player_name',
y='goal_count',
title='Top 5 Players with the Most Goals Across All World Cups',
color='player_country',
color_discrete_sequence=px.colors.qualitative.Prism
)
fig5.update_xaxes(title_text='Player')
fig5.update_yaxes(title_text='Goal Count')
fig5.show()
fig6 = px.bar(
attendance_df,
x='year',
y='attendance',
color='type', # Color bars by type (highest/lowest)
title='World Cup Edition with Highest and Lowest Attendance',
color_discrete_sequence=px.colors.qualitative.Prism,
hover_data=['host_country'],
labels={'host_country': 'country'}
)
fig6.update_xaxes(title_text='Year')
fig6.update_yaxes(title_text='Attendance')
fig6.show()
fig7 = px.bar(
attendance_matches_wc_df,
x='stage',
y='avg_attendance',
title='Average Attendance for Matches in Each Stage',
color='stage', # Color bars by stage
color_discrete_sequence=px.colors.qualitative.Prism
)
fig7.update_xaxes(title_text='Stage')
fig7.update_yaxes(title_text='Average Attendance')
fig7.show()
fig8 = px.bar(
top_captains_wc_df,
x='player_name',
y='captaincy_count',
title='Players Who Have Served as Captains the Most Number of Times',
color='country',
color_discrete_sequence=px.colors.qualitative.Prism,
)
fig8.update_xaxes(title_text='Player')
fig8.update_yaxes(title_text='Captaincy Count')
fig8.show()
fig9 = px.bar(
city_matches_wc_df,
x='city',
y='num_matches_hosted',
title='Top 10 Cities That Have Hosted the Most Matches',
color='city',
color_discrete_sequence=px.colors.qualitative.Prism)
fig9.update_xaxes(title_text='City')
fig9.update_yaxes(title_text='Number of Matches Hosted')
fig9.show()
# Add a new column 'total_goals' which is the sum of home and away goals
matches_goals_wc_df['total_goals'] = matches_goals_wc_df['home_goals'] + matches_goals_wc_df['away_goals']
# Create the scatter plot with the new 'total_goals' column and Prism color palette
fig10 = px.scatter(matches_goals_wc_df, x="attendance", y="total_goals",
title="Attendance vs. Total Goals Scored for Each Match",
color_discrete_sequence=px.colors.qualitative.Prism)
# Customize the layout
fig10.update_layout(
xaxis_title="Attendance",
yaxis_title="Total Goals Scored",
hovermode="closest"
)
# Show the plot
fig10.show()
#Export some of the graphs (for blog post)
# Export fig1 as HTML
pio.write_html(fig1, file='fig1.html', auto_open=False)
# Export fig3 as HTML
pio.write_html(fig3, file='fig3.html', auto_open=False)
# Export fig4 as HTML
pio.write_html(fig4, file='fig4.html', auto_open=False)
# Export fig5 as HTML
pio.write_html(fig5, file='fig5.html', auto_open=False)
# Export fig8 as HTML
pio.write_html(fig8, file='fig8.html', auto_open=False)
# Export fig9 as HTML
pio.write_html(fig9, file='fig9.html', auto_open=False)
The FIFA World Cup, a pinnacle of international football, has been a spectacle since its inception in 1930. Delving into a comprehensive dataset of World Cup history (until 2014), the following insights were uncovered:
In the end, the FIFA World Cup is not just a sporting event; it's a story of nations, passion, and the beautiful game. Each statistic and record is a chapter in this enthralling narrative, weaving a tapestry of football's most illustrious competition. As the world unites every four years to celebrate football, these insights remind us of the rich heritage and unpredictability that make the World Cup a timeless spectacle.
Close connection and engine to avoid resource leaks.
# Close the connection
connection.close()
# Close the engine
engine.dispose()